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Abstract: In this work we are interested in the problem of scheduling and redistributing data 
on master-slave platforms. We consider the case were the workers possess initial loads, some of 
which having to be redistributed in order to balance their completion times. 

We examine two different scenarios. The first model assumes that the data consists of inde- 
pendent and identical tasks. We prove the NP-completeness in the strong sense for the general 
case, and we present two optimal algorithms for special platform types. Furthermore we propose 
three heuristics for the general case. Simulations consolidate the theoretical results. 

The second data model is based on Divisible Load Theory. This problem can be solved in 
polynomial time by a combination of linear programming and simple analytical manipulations. 

Key-words: Master-slave platform, scheduling, data redistribution, one-port model, indepen- 
dent tasks, divisible load theory. 
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Strategies d'ordonnancement et de redistribution de donnees 

sur plate-formes en etoile 

Resume : Dans ce travail on s'interesse au probleme d'ordonnancement et de redistribution 
de donnees sur plates-formes maitre-esclaves. On considere le cas ou les esclaves possedent des 
donnees initiales, dont quelques-unes doivent etre redistribuees pour equilibrer leur dates de fin. 

On examine deux scenarios difi^erents. Le premier modele suppose que les donnees sont des 
taches independantes identiques. On prouve la NP-completude dans le sens fort pour le cas 
general, et on presente deux algorithmes pour des plates-formes speciales. De plus on propose trois 
heuristiques pour le cas general. Des resultats experimentaux obtenus par simulation viennent a 
I'appui des resultats theoriques. 

Mots-cles : Plate-forme maitre-esclave, ordonnancement, equilibrage de charge, modele un-port, 
taches independantes, taches divisibles. 
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1 Introduction 

In this work we consider the problem of scheduling and redistributing data on master-slave ar- 
chitectures in star topologies. Because of variations in the resource performance (CPU speed or 
communication bandwidth), or because of unbalanced amounts of current load on the workers, 
data must be redistributed between the participating processors, so that the updated load is better 
balanced in terms that the overall processing finishes earlier. 

We adopt the following abstract view of our problem. There are to 4- 1 participating processors 
Po,Pi, . . . ,Pm, where Pq is the master. Each processor Pk, 1 < k < m initially holds Lk data 
items. During our scheduling process we try to determine which processor Pi should send some 
data to another worker Pj to equilibrate their finishing times. The goal is to minimize the global 
makespan, that is the time until each processor has finished to process its data. Furthermore 
we suppose that each communication link is fully bidirectional, with the same bandwidth for 
receptions and sendings. This assumption is quite realistic in practice, and does not change the 
complexity of the scheduling problem, which we prove NP-complete in the strong sense. 

We examine two different scenarios for the data items that are situated at the workers. The 
first model supposes that these data items consist in independent and uniform tasks, while the 
other model uses the DIVISIBLE Load Theory paradigm (DLT) [4]. 

The core of DLT is the following: DLT assumes that communication and computation loads 
can be fragmented into parts of arbitrary size and then distributed arbitrarily among different 
processors to be processed there. This corresponds to perfect parallel jobs: They can be split into 
arbitrary subtasks which can be processed in parallel in any order on any number of processors. 

Beaumont, Marchal, and Robert [2] treat the problem of divisible loads with return messages 
on heterogeneous master- worker platforms (star networks) . In their framework, all the initial load 
is situated at the master and then has to be distributed to the workers. The workers compute their 
amount of load and return their results to the master. The difficulty of the problem is to decide 
about the sending order from the master and, at the same time, about the receiving order. In this 
paper problems are formulated in terms of linear programs. Using this approach the authors were 
able to characterize optimal LIFO^ and FIFO^ strategies, whereas the general case is still open. 
Our problem is different, as in our case the initial load is already situated at the workers. To the 
best of our knowledge, we are the first to tackle this kind of problem. 

Having discussed the reasons and background of DLT, we dwell on the interest of the data 
model with uniform and independent tasks. Contrary to the DLT model, where the size of load 
can be diversified, the size of the tasks has to be fixed at the beginning. This leads to the first 
point of interest: When tasks have different sizes, the problem is NP complete because of an ob- 
vious reduction to 2-partition [12]. The other point is a positive one: there exists lots of practical 
applications who use fixed identical and independent tasks. A famous example is BOINC [5], 
the Berkeley Open Infrastructure for Network Computing, an open-source software platform for 
volunteer computing. It works as a centralized scheduler that distributes tasks for participating 
applications. These projects consists in the treatment of computation extensive and expensive sci- 
entific problems of multiple domains, such as biology, chemistry or mathematics. SETI@home [22] 
for example uses the accumulated computation power for the search of extraterrestrial intelligence. 
In the astrophysical domain, Einstein@home [11] searches for spinning neutron stars using data 
from the LIGO and GEO gravitational wave detectors. To get an idea of the task dimensions, in 
this project a task is about 12 MB and requires between 5 and 24 hours of dedicated computation. 

As already mentioned, we suppose that all data are initially situated on the workers, which 
leads us to a kind of redistribution problem. Existing redistribution algorithms have a different 
objective. Neither do they care how the degree of imbalance is determined, nor do they include 
the computation phase in their optimizations. They expect that a load-balancing algorithm has 
already taken place. With help of these results, a redistribution algorithm determines the required 
communications and organizes them in minimal time. Renard, Robert, and Vivien present some 
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optimal redistribution algorithms for heterogeneous processor rings in [20]. We could use this 
approach and redistribute the data first and then enter in a computation phase. But our problem 
is more complicated as we suppose that communication and computation can overlap, i.e., every 
worker can start computing its initial data while the redistribution process takes place. 

To summarize our problem: as the participating workers are not equally charged and/or be- 
cause of different resource performance, they might not finish their computation process at the 
same time. So we are looking for mechanisms on how to redistribute the loads in order to finish 
the global computation process in minimal time under the hypothesis that charged workers can 
compute at the same time as they communicate. 

The rest of this report is organized as follows: Section 2 presents some related work. The 
data model of independent and identical tasks is treated in Section 3: In Section 3.2 we discuss 
the case of general platforms. We are able to prove the NP-completeness for the general case 
of our problem, and the polynomiality for a restricted problem. The following sections consider 
some particular platforms: an optimal algorithm for homogeneous star networks is presented in 
Section 3.3, Section 3.4 treats platforms with homogenous communication links and heteroge- 
neous workers. The presentation of some heuristics for heterogeneous platforms is the subject 
in Section 3.5. Simulative test results are shown in Section 4. Section 5 is devoted to the DLT 
model. We propose a linear program to solve the scheduling problem and propose formulas for 
the redistribution process. 

2 Related work 

Our work is principally related with three key topics. Since the early nineties DIVISIBLE Load 
Theory (DLT) has been assessed to be an interesting method of distributing load in parallel 
computer systems. The outcome of DLT is a huge variety of scheduling strategies on how to 
distribute the independent parts to achieve maximal results. As the DLT model can be used on a 
vast variety of interconnection topologies like trees, buses, hypercubes and so on, in the literature 
theoretical and applicative elements are widely discussed. In his article Robertazzi gives Ten Rea- 
sons to Use Divisible Load Theory [21], like scalability or extending realism. Probing strategies 
[13] were shown to be able to handle unknown platform parameters. In [8] evaluations of efficiency 
of DLT are conducted. The authors analyzed the relation between the values of particular pa- 
rameters and the efficiency of parallel computations. They demonstrated that several parameters 
in parallel systems are mutually related, i.e., the change of one of these parameters should be 
accompanied by the changes of the other parameters to keep efficiency. The platform used in this 
article is a star network and the results are for applications with no return messages. Optimal 
scheduling algorithms including return messages are presented in [1]. The authors are treating 
the problem of processing digital video sequences for digital TV and interactive multimedia. As a 
result, they propose two optimal algorithms for real time frame-by-frame processing. Scheduling 
problems with multiple sources are examined [17]. The authors propose closed form solutions for 
tree networks with two load originating processors. 

Redistribution algorithms have also been well studied in the literature. Unfortunately 
already simple redistribution problems are NP complete [15]. For this reason, optimal algorithms 
can be designed only for particular cases, as it is done in [20]. In their research, the authors 
restrict the platform architecture to ring topologies, both uni-directional and bidirectional. In the 
homogeneous case, they were able to prove optimality, but the heterogenous case is still an open 
problem. In spite of this, other efficient algorithms have been proposed. For topologies like trees 
or hypercubes some results are presented in [25]. 

The load balancing problem is not directly dealt with in this paper. Anyway we want 
to quote some key references to this subject, as the results of these algorithms are the starting 
point for the redistribution process. Generally load balancing techniques can be classified into 
two categories. Dynamic load balancing strategies and static load balancing. Dynamic techniques 
might use the past for the prediction of the future as it is the case in [7] or they suppose that the 
load varies permanently [14]. That is why for our problem static algorithms are more interesting: 
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we are only treating star-platforms and as the amount of load to be treated is known a priory 
we do not need prediction. For homogeneous platforms, the papers in [23] survey existing results. 
Heterogeneous solutions are presented in [19] or [3]. This last paper is about a dynamic load 
balancing method for data parallel applications, called the working-manager method: the 
manager is supposed to use its idle time to process data itself. So the heuristic is simple: when 
the manager does not perform any control task it has to work, otherwise it schedules. 

3 Load balancing of independent tasks using the one-port 
bidirectional model 

3.1 Framework 

In this part we will work with a star network S = Pq, Pi, . . . , Pm shown in Figure 1. The processor 
Pq is the master and the m remaining processors Pi, 1 < i < m, are workers. The initial data are 
distributed on the workers, so every worker Pi possesses a number Li of initial tasks. All tasks 
are independent and identical. As we assume a linear cost model, each worker Pi has a (relative) 
computing power Wi for the computation of one task: it takes X.Wi time units to execute X tasks 
on the worker Pi. The master Pq can communicate with each worker Pi via a communication link. 
A worker Pi can send some tasks via the master to another worker Pj to decrement its execution 
time. It takes X.Ci time units to send X units of load from Pi to Pq and X.cj time units to send 
these X units from Pq to a worker Pj . Without loss of generality we assume that the master is 
not computing, and only communicating. 




Wi W2 Wi W, 



Figure 1: Example of a star network. 

The platforms dealt with in sections 3.3 and 3.4 are a special case of a star network: all 
communication links have the same characteristics, i.e., Ci = c for each processor Pi, I < i < k. 
Such a platform is called a bus network as it has homogeneous communication links. 

We use the bidirectional one-port model for communication. This means, that the master 
can only send data to, and receive data from, a single worker at a given time-step. But it can 
simultaneously receive a data and send one. A given worker cannot start an execution before it 
has terminated the reception of the message from the master; similarly, it cannot start sending 
the results back to the master before finishing the computation. 

The objective function is to minimize the makespan, that is the time at which all loads have 
been processed. So we look for a schedule a that accomplishes our objective. 

3.2 General platforms 

Using the notations and the platform topology introduced in Section 3.1, we now formally present 
the Scheduling Problem for Master- Slave Tasks on a Star of Heterogeneous Pro- 
cessors (SPMSTSHP). 
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Figure 2: Platform parameters. 



Figure 3: Example of an optimal schedule on a 
heterogeneous platform, where a sending worker 
also receives a task. 



Definition 1 (SPMSTSHP). 

Let N be a star-network with one special processor Pq called "master" and m workers. Let 
n be the number of identical tasks distributed to the workers. For each worker Pi, let Wi be the 
computation time for one task. Each communication link, linki, has an associated communication 
time Ci for the transmission of one task. Finally let T be a deadline. 

The question associated to the decision problem of SPMSTSHP is: "Is it possible to redistribute 
the tasks and to process them in time T ?". 

One of the main difficulties seems to be the fact that we cannot partition the workers into dis- 
joint sets of senders and receivers. There exists situations where, to minimize the global makespan, 
it is useful, that sending workers also receive tasks. (You will see later in this report that we can 
suppose this distinction when communications are homogeneous.) 

We consider the following example. We have four workers (see Figure 2 for their parameters) 
and a makespan fixed to M = 12. An optimal solution is shown in Figure 3: Workers P3 and P4 do 
not own any task, and they are computing very slowly. So each of them can compute exactly one 
task. Worker Pi, who is a fast processor and communicator, sends them their tasks and receives 
later another task from worker P2 that it can compute just in time. Note that worker Pi is both 
sending and receiving tasks. Trying to solve the problem under the constraint that no worker 
also sends and receives, it is not feasible to achieve a makespan of 12. Worker P2 has to send 
one task either to worker P3 or to worker P4. Sending and receiving this task takes 9 time units. 
Consequently the processing of this task can not finish earlier than time t = 18. 

Another difficulty of the problem is the overlap of computation and the redistribution process. 
Subsequently we examine our problem neglecting the computations. We are going to prove an 
optimal polynomial algorithm for this problem. 



3.2.1 Polynomiality when computations are neglected 

Examining our original problem under the supposition that computations are negligible, we get 
a classical data redistribution problem. Hence we eliminate the original difficulty of the overlap 
of computation with the data redistribution process. We suppose that we already know the 
imbalance of the system. So we adopt the following abstract view of our new problem: the m 
participating workers Pi, P2, ■ • • P^yi hold their initial uniform tasks Li, 1 < i < m. For a worker P,; 
the chosen algorithm for the computation of the imbalance has decided that the new load should 
he Li — 5i. If (5i > 0, this means that Pi is overloaded and it has to send Si tasks to some other 
processors. If Si < 0, Pi is underloaded and it has to receive —Si tasks from other workers. We 
have heterogeneous communication links and all sent tasks pass by the master. So the goal is to 
determine the order of senders and receivers to redistribute the tasks in minimal time. 
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As all communications pass by the master, workers can not start receiving until tasks have 
arrived on the master. So to minimize the redistribution time, it is important to charge the master 
as fast as possible. Ordering the senders by non-decreasing Q-values makes the tasks at the earliest 
possible time available. 

Suppose we would order the receivers in the same manner as the senders, i.e., by non-decreasing 
Ci-values. In this case we could start each reception as soon as possible, but always with the 
restriction that each task has to arrive first at the master (see Figure 4(b)). So it can happen that 
there are many idle times between the receptions if the tasks do not arrive in time on the master. 
That is why we choose to order the receiver in reversed order, i.e., by non-increasing Q-values (cf. 
Figure 4(c)), to let the tasks more time to arrive. In the following lemma we even prove optimality 
of this ordering. 




(a) Example of load imbalance (b) The receivers are ordered by non- 

on a heterogeneous platform decreasing order of their c^-values. 

with 4 workers. 



Pi 
Pi 

P3 
P4 



n d 



fenders 



■receivers 



r= 12 



(c) The receivers are ordered by non- 
increasing order of their Ci-values. 



Figure 4: Comparison of the ordering of the receivers. 



Theorem 1. Knowing the imbalance Si of each processor, an optimal solution for heteroge- 
neous star-platforms is to order the senders by non- decreasing Ci-values and the receivers by non- 
increasing order of Ci -values. 

Proof. To prove that the scheme described by Theorem 1 returns an optimal schedule, we take 
a schedule S' computed by this scheme. Then we take any other schedule S. We are going to 
transform S in two steps into our schedule S" and prove that the makespans of the both schedules 
hold the following inequality: M{S') < M{S). 

In the first step we take a look at the senders. The sending from the master can not start 
before tasks are available on the master. We do not know the ordering of the senders in S but 
we know the ordering in S': all senders are ordered in non-decreasing order of their Q-values. Let 
io be the first task sent in S where the sender of task io has a bigger Cj-value than the sender 
of the {in -\- l)-th task. We then exchange the senders of task io and task (io -I- 1) and call this 
new schedule Snew Obviously the reception time for the second task is still the same. But as 
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you can see in Figure 5, the time when the first task is available on the master has changed: after 
the exchange, the first task is available earlier and ditto ready for reception. Hence this exchange 
improves the availability on the master (and reduces possible idle times for the receivers) . We use 
this mechanism to transform the sending order of S in the sending order of S' and at each time 
the availability on the master is improved. Hence at the end of the transformation the makespan 
of Snew is smaller than or equal to that of S and the sending order of Snew and S' is the same. 



Pio+1 


















^ 


r 


















t " 



Figure 5: Exchange of the sending order makes tasks available earlier on the master. 

In the second step of the transformation we take care of the receivers (cf . Figures 6 and 7) . 
Having already changed the sending order of S by the first transformation of S into Snew , we start 
here directly by the transformation of Snew Using the same mechanism as for the senders, we call 
jo the first task such that the receiver of task jo has a smaller Ci-value than the receiver of task 
jo + 1. We exchange the receivers of the tasks jo and jo + 1 and call the new schedule S^^^a) . 
jo is sent at the same time than previously, and the processor receiving it, receives it earlier than 
it received jo+i in Snew jo+i is sent as soon as it is available on the master and as soon as the 
communication of task jo is completed. The first of these two conditions had also to be satisfied 
by Snew ■ If the second condition is delaying the beginning of the sending of the task jo + 1 from 
the master, then this communication ends at time tin + C7r'(j(,) + c,r'(j„+i) = im + c-^ijo+i) + '^ttUo) 
and this communication ends at the same time than under the schedule Snew ( here 7r(jo) (7r'(jo)) 
denotes the receiver of task jo in schedule Snew {Snewf.^^ ) respectively)). Hence the finish time of 
the communication of task jo + 1 in schedule Sj^^.^^(i) is less than or equal to the finish time in 
the previous schedule. In all cases, Af(S'„g^(i)) < M {Sneiu) ■ Note that this transformation does 
not change anything for the tasks received after jo+i except that we always perform the scheduled 
communications as soon as possible. Repeating the transformation for the rest of the schedule 
Snew we reduce all idle times in the receptions as far as possible. We get for the makespan 
of each schedule S^^^^ik): M[Sn^^i^k)) < M{Snew) < M{S). As after these (finite number of) 
transformations the order of the receivers will be in non-decreasing order of the Q-values, the 
receiver order of Snew'.°°'> is the same as the receiver order of S" and hence we have Snew^^) = ^' ■ 
Finally we conclude that the makespan of S' is smaller than or equal to any other schedule S and 
hence S' is optimal. 



P. 



P 



tOo+I) 



□ 



idle 



P 



P 



tOo+1) 




Figure 6: Exchange of the receiving order suits better with the available tasks on the master. 



□ 



3.2.2 NP-completeness of the original problem 

Now we are going to prove the NP-completeness in the strong sense of the general problem. For 
this we were strongly inspired by the proof of Dutot [10, 9] for the Scheduling Problem for 
Master-Slave Tasks on a Tree of Heterogeneous Processors (SPMSTTHP). This proof 
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Figure 7: Deletion of idle time due to the exchange of the receiving order. 



uses a two level tree as platform topology and we are able to associate the structure on our star- 
platform. We are going to recall the 3-partition problem which is NP-complete in the strong sense 
[12]. 

Deflnition 2 (3-Partition). 

Let S and n be two integers, and let (j/i)i6i -3n sequence of 3n integers such that for each 
I < ?/i < f . 

The question of the 3-partition problem is "Can we partition the set of the yi in n triples such 
that the sum of each triple is exactly S ? ". 

Theorem 2. SPMSTSHP is NP-complete in the strong sense. 

Proof. We take an instance of 3-partition. We define some real numbers Xi, 1 < i < 3n, by 
Xi = jS+Y- If a triple of yi has the sum S, the corresponding triple of Xi corresponds to the sum 
^ and vice versa. A partition of yi in triples is thus equivalent to a partition of the Xi in triples 
of the sum ^ . This modification allows us to guarantee that the Xi are contained in a smaller 
interval than the interval of the yi. Effectively the Xi are strictly included between |j and 



Reduction. For our reduction we use the star-network shown in Figure 8. We consider the 
following instance of SPMTSHP: Worker P owns An tasks, the other in workers do not hold 
any task. We work with the deadline T — E + nS + j, where E is an enormous time fixed to 
E = {n + 1)S. The communication link between P and the master has a c- value of j. So it can 
send a task all j time units. Its computation time is T + 1, so worker P has to distribute all its 
tasks as it can not finish processing a single task by the deadline. Each of the other workers is 
able to process one single task, as its computation time is at least E and we have 2E > T, what 
makes it impossible to process a second task by the deadline. 



Po 




Figure 8: Star platform used in the reduction. 
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This structure of the star-network is particularly constructed to reproduce the 3-partition 
problem in the scope of a scheduling problem. We are going to use the bidirectional 1-port 
constraint to create our triplets. 

Creation of a schedule out of a solution to 3-partition. First we show how to construct 
a valid schedule of in tasks in time j + nS + E out of a 3-partition solution. To facilitate the 
lecture, the processors Pi are ordered by their x^-values in the order that corresponds to the 
solution of 3-partition. So, without loss of generality, we assume that for each j G [0,7i — 1], 
xsj+i + X3j-^2 + X3j+3 = The schedule is of the following form: 

1. Worker P sends its tasks as soon as possible to the master, i.e., every j time units. So it is 
guaranteed that the An tasks are sent in nS time units. 

2. The master sends the tasks as soon as possible in incoming order to the workers. The receiver 
order is the following (for all j G [0, n — 1]): 

• Task Aj + 1, over link of cost a;3j+i, to processor ^sj+i. 

• Task 4:j + 2, over Hnk of cost X3j+2, to processor ^3^+2- 

• Task Aj + 3, over link of cost x^j+s, to processor ^^3^+3. 

• Task Aj + 4, over link of cost j, to processor Qn-i-j- 

The distribution of the four tasks, 4j + 1, 4j -I- 2, 4j + 3, 4j + 4, takes exactly S time units 
and the master needs also S time units to receive four tasks from processor P. Furthermore, each 
Xi is larger than j. Therefore, after the first task is sent, the master always finishes to receive a 
new task before its outgoing port is available to send it. The first task arrives at time -j at the 
master, which is responsible for the short idle time at the beginning. The last task arrives at its 
worker at time j +nS and hence it rests exactly E time units for the processing of this task. For 
the workers Pi, 1 < i < 3n, we know that they can finish to process their tasks in time as they 
all have a computation power of E. The computation power of the workers Qi, < i < n — 1, is 
E + i X S and as they receive their task at time j + {n — i — 1) x S + they have exactly the 
time to finish their task. 

Getting a solution for 3-partition out of a schedule. Now we prove that each schedule of 
4n tasks in time T creates a solution to the 3-partition problem. 

As already mentioned, each worker besides worker P can process at most one task. Hence due 
to the number of tasks in the system, every worker has to process exactly one task. Furthermore 
the minimal time needed to distribute all tasks from the master and the minimal processing time 
on the workers induces that there is no idle time in the emissions of the master, otherwise the 
schedule would take longer than time T. 

We also know that worker P is the only sending worker: 

Lemma 1. No worker besides worker P sends any task. 

Proof. Due to the platform configuration and the total number of tasks, worker P has to send 
all its tasks. This takes at least nS time units. The total emission time for the master is also nS 
time units: as each worker must process a task, each of them must receive one. So the emission 
time for the master is larger than or equal to X]"=i a;^ -I- n x |^ = nS. As the master cannot start 
sending the first task before time j and as the minimum computation power is E, then if the 
master sends exactly one task to each slave, the makespan is greater than or equal to T and if one 
worker besides P sends a task, the master will at least send one additional task and the makespan 
will be strictly greater than T. □ 

Now we are going to examine the worker Qn-i and the task he is associated to. 
Lemma 2. The task associated to worker Qn-i is one of the first four tasks sent by worker P. 
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Proof. The computation time of worker Qn-i is E + {n — 1)5', hence its task has to arrive no 
later than time S + j. The fifth task arrives at the soonest at time + as worker P has to 
send five tasks as the shortest communication time is |^ . The following tasks arrive later than the 
5-th task, so the task for worker Qn-i has to be one of the first four tasks. □ 

Lemma 3. The first three tasks are sent to some worker Pi, I < i < 3n. 

Proof. As already mentioned, the master has to send without any idle time besides the initial 
one. Hence we have to pay attention that the master always possesses a task to send when he 
finishes to send a task. While the master is sending to a worker Pi , worker P has the time to send 
the next task to the master. But, if at least one of the first three tasks is sent to a worker Qi, the 
sending time of the first three tasks is strictly inferior to + ^5* + -^S = jS. Hence there is 
obligatory an idle time in the emission of the master. This pause makes the schedule of An tasks 
in time T infeasible. □ 

A direct conclusion of the two precedent lemmas is that the 4-th task is sent to worker Q„-i. 

Lemma 4. The first three tasks sent by worker P have a total communication time of time 
units. 



Proof. Worker Qn-i has a computation time of E + {n — 1)S, it has to receive its task no later 
than time 15. This implies that the first three tasks are sent in a time no longer than ^S. 

On the other side, the 5-th task arrives at the master no sooner than time |5. As the master has 
to send without idle time, the emission to worker Qn-i has to persist until this date. Necessarily 
the first three emissions of the master take at minimum a time |5. □ 

Lemma 5. Scheduling An tasks in a time T = j + nS + E units of time allows to reconstruct an 
instance of the associated 3-partition problem. 

Proof. In what precedes, we proved that the first three tasks sent by the master create a triple 
whose sum is exactly |. Using this property recursively on j for the triple 4j + 1, 4j -f 2 and 
4j -I- 3, we show that we must send the tasks 4j -I- 4 to the worker Qn-i-j- With this method 
we construct a partition of the set of Xi in triples of sum |. These triples are a solution to the 
associated 3-partition problem. □ 

Having proven that we can create a schedule out of a solution of 3-partition and also that we 
can get a solution for 3-partition out of a schedule, the proof is now complete. 

□ 



3.3 An algorithm for scheduling on homogeneous star platforms: the 
best-balance algorithm 

In this section we present the Best-Balance Algorithm (BBA), an algorithm to schedule on 
homogeneous star platforms. As already mentioned, we use a bus network with communication 
speed c, but additionally we suppose that the computation powers are homogeneous as well. So 
we have Wi = w for alH, 1 < i < to. 

The idea of BBA is simple: in each iteration, we look if we could finish earlier if we redistribute 
a task. If so, we schedule the task, if not, we stop redistributing. The algorithm has polynomial 
run-time. It is a natural intuition that BBA is optimal on homogeneous platforms, but the formal 
proof is rather complicated, as can be seen in Section 3.3.2. 
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3.3.1 Notations used in BBA 

BBA schedules one task per iteration i. Let denote the number of tasks of worker k after 
iteration z, i.e., after i tasks were redistributed. The date at which the master has finished receiving 
the i-th task is denoted by master _in^^\ In the same way we call master _out^'^^ the date at which 
the master has finished sending the z-th task. Let end^^^ be the date at which worker k would finish 
to process the load it would hold if exactly i tasks are redistributed. The worker k in iteration i 
with the biggest finish time endj^. , who is chosen to send one task in the next iteration, is called 
sender. We call receiver the worker k with smallest finish time end], in iteration i who is chosen 
to receive one task in the next iteration. 

In iteration i = we are in the initial configuration: All workers own their initial tasks 
L^^ = Lk and the makespan of each worker k is the time it needs to compute all its tasks: 
end^^^ = L^"' x w. master _in^^^ = master _out^^^ = 0. 

3.3.2 The Best Balance Algorithm - BBA 

We first sketch BBA: 
In each iteration i do: 

• Compute the time endj,' it would take worker k to process tasks. 

• A worker with the biggest finish time end[,'~^^ is arbitrarily chosen as sender, he is called 
sender. 

-(i) 

• Compute the temporary finish times endf. of each worker if it would receive from sender 
the z-th task. 

(*) 

• A worker with the smallest temporary finish time end/, will be the receiver, called receiver. 

(i) 

If there are multiple workers with the same temporary finish time endj. , we take the worker 

(i—l) 

with the smallest finish time end], 

(i) 

• If the finish time of sender is strictly larger than the temporary finish time end^^^^^^, of 
sender, sender sends one task to receiver and iterate. Otherwise stop. 

Lemma 6. On homogeneous star-platforms, in iteration i the Best-Balance Algorithm (Al- 
gorithm 1) always chooses as receiver a worker which finishes processing the first in iteration 
i - 1. 

Proof. As the platform is homogeneous, all communications take the same time and all compu- 
tations take the same time. In Algorithm 1 the master chooses as receiver in iteration i the worker 
k that would end the earliest the processing of the i-th task sent. To prove that worker k is also 
the worker which finishes processing in iteration i~ \ first, we have to consider two cases: 

• Task i arrives when all workers are still working. 

As all workers are still working when the master finishes to send task i, the master chooses 
as receiver a worker which finishes processing the first, because this worker will also finish 
processing task i first, as we have homogeneous conditions. See Figure 9(a) for an example: 
the master chooses worker k as in iteration z — 1 it finishes before worker j and it can thus 
start computing task z -I- 1 earlier than worker j could do. 

• Task i arrives when some workers have finished working. 

If some workers have finished working when the master can finish to send task z, we are 
in the situation of Figure 9(b): All these workers could start processing task i at the same 
time. As our algorithm chooses in this case a worker which finished processing first (see line 
13 in Algorithm 1), the master chooses worker k in the example. □ 
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(a) All workers are still processing 



(b) Some workers have already 
finished processing 



Figure 9: In iteration i: The master chooses which worker will be the receiver of task i. 



The aim of these schedules is always to minimize the makespan. So workers who take a long 
time to process their tasks are interested in sending some tasks to other workers which are less 
charged in order to decrease their processing time. If a weakly charged worker sends some tasks 
to another worker this will not decrease the global makespan, as a strongly charged worker has 
still its long processing time or its processing time might even have increased if it was the receiver. 
So it might happen that the weakly charged worker who sent a task will receive another task in 
another scheduling step. In the following lemma we will show that this kind of schedule, where 
sending workers also receive tasks, can be transformed in a schedule where this effect does not 
appear. 

Lemma 7. On a platform with homogeneous communications, if there exists a schedule S with 
makespan M , then there also exists a schedule S' with a makespan M' < M such that no worker 
both sends and receives tasks. 

Proof. We will prove that we can transform a schedule where senders might receive tasks in a 
schedule with equal or smaller makespan where senders do not receive any tasks. 



Sk 





Figure 10: Scheme on how to break up sending chains. 

If the master receives its i-th task from processor Pj and sends it to processor Pk , we say that 
Pk receives this task from processor Pj . 

Whatever the schedule, if a sender receives a task we have the situation of a sending chain (see 
Figure 10): at some step of the schedule a sender s.j sends to a sender s^, while in another step of 
the schedule the sender Sk sends to a receiver rj. So the master is occupied twice. As all receivers 
receive in fact their tasks from the master, it does not make a difference for them which sender 
sent the task to the master. So we can break up the sending chain in the following way: We look 
for the earliest time, when a sending worker, Sk, receives a task from a sender, Si. Let rj be a 
receiver that receives a task from sender Sk- There are two possible situations: 
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1. Sender Si sends to sender Sk and later sender Sk sends to receiver rj, see Figure 11(a). This 
case is simple: As the communication from Si to Sk takes place first and we have homogeneous 
communication links, we can replace this communication by an emission from sender st to 
receiver rj and just delete the second communication. 

2. Sender Sk sends to receiver rj and later sender s,; sends to sender Sk, see Figure 11(b). In this 
case the reception on receiver rj happens earher than the emission of sender Si, so we can 
not use exactly the same mechanism as in the previous case. But we can use our hypothesis 
that sender Sk is the first sender that receives a task. Therefore, sender Si did not receive any 
task until Sk receives. So at the moment when Sk sends to rj, we know that sender Si already 
owns the task that it will send later to sender Sk- As we use homogeneous communications, 
we can schedule the communication s; — > rj when the communication Sk — > rj originally 
took place and delete the sending from Si to Sk. 

As in both cases we gain in communication time, but we keep the same computation time, we 
do not increase the makespan of the schedule, but we transformed it in a schedule with one less 
sending chain. By repeating this procedure for all sending chains, we transform the schedule S in 
a schedule S' without sending chains while not increasing the makespan. □ 



Figure 11: How to break up sending chains, dark colored communications are emissions, light 
colored communications represent receptions. 

Proposition 1. Best-Balance Algorithm (Algorithm 1) calculates an optimal schedule S on 
a homogeneous star network, where all tasks are initially located on the workers and communication 
capabilities as well as computation capabilities are homogeneous and all tasks have the same size. 

Proof. To prove that BBA is optimal, we take a schedule Saigo calculated by Algorithm 1. Then 
we take an optimal schedule Sopt ■ (Because of Lemma 7 we can assume that in the schedule Sopt 
no worker both sends and receives tasks.) We are going to transform by induction this optimal 
schedule into our schedule Saigo- 

As we use a homogeneous platform, all workers have the same communication time c. Without 
loss of generality, we can assume that both algorithms do all communications as soon as possible 
(see Figure 12). So we can divide our schedule Saigo in Sa steps and Sopt in So steps. A step 
corresponds to the emission of one task, and we number in this order the tasks sent. Accordingly 
the s-th task is the task sent during step s and the actual schedule corresponds to the load 
distribution after the s first tasks. We start our schedule at time T = 0. 

Let S{i) denote the worker receiving the z-th task under schedule S. Let iq be the first step 
where Sopt differs from Saigo, i.e., Saigoiio) Soptiio) and Vi < io, Saigo{i) = Soptii)- We look for 
a step j > io, if it exists, such that Sopt{j) = Saigoiio) and j is minimal. 

We are in the following situation: schedule Sopt and schedule Saigo are the same for all tasks 
[l..(io — 1)]. As worker Saigoiio) is chosen at step io, then, by definition of Algorithm 1, this 
means that this worker finishes first its processing after the reception of the {io — l)-th tasks (cf. 




(a) Sender Si sends to receiving sender sj. and 
then sender Sk sends to receiver rj . 



(b) Sender Sfc sends first to receiver rj and 
then receives from sender Si. 
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Algorithm 1 Best-Balance Algorithm 



/* initialization */ 

i ^ 

master _irS^^ 
master 

enS^^ ^ hf'' X w 
/* the scheduling */ 
while true do 

sender ^ max^ encff' 

master _irS''^^^ <— master _in^^^ + c 

task _arrival _worker = T[Ydu'x.{master _irS^~^^\ master _out''^^) + c 
Vfc endj^ ^ niax{end^i^^^\task _arrival _worker) + w 

-(i+l) 

select receiver such that end^.^^^^^^^ = miiifc end). and if there are several processors with 

(i+l) u-) 

the same minimum end^ , choose one with the smallest endj, 
(i) ^=+1) 

if ^nd\l^der < (ind^eceiver then 

/* we can not improve the makespan anymore */ 
break 
else 

/* we improve the makespan by sending the task to the receiver * / 
master _out^^^^^ ^ task _arrival _worker 

7-(i+l) ^ r(0 _ 1 

senrfer sender 

receiver ^'"^receiver 

7-(i+i) ^ rW 11 

receiver receiver ^ 



for all i ^ receiver and j ^ sender do 
end^^^^"^ <— end^^^ 

end for 



end if 
end while 
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receptions by the master: 
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3 




n 




scndings from the master: 
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2 




n - 1 


n 



T = 



Figure 12: Occupation of the master. 



Lemma 6). As Sopt and Saigo differ in step io, we know that Sopt chooses worker Soptiio) that 
finishes the schedule of its load after step (io — 1) no sooner than worker Saigo{io)- 

Case 1: Let us first consider the case where there exists such a step j. So Saigo{io) = SoptU) 
and j > io- We know that worker SoptU) under schedule Sopt does not receive any task between 
step io and step j as j is chosen minimal. 

We use the following notations for the schedule Sopt, depicted on Figures 13, 14, and 15: 

Tj: the date at which the reception of task j is finished on worker SoptU), i-G-, Tj = j x c + c (the 
time it takes the master to receive the first task plus the time it takes him to send j tasks) . 

Ti(, : the date at which the reception of task io is finished on worker Soptiio), i.e., Tio = io x c + c. 



prei 



,d(j): time when computation of task predU) is finished, where task predU) denotes the last 
task which is computed on worker SoptU) before task j is computed. 



Fpred(io)' time when computation of task pred{io) is finished, where task pred{io) denotes the 
last task which is computed on worker Soptiio) before task io is computed. 



We have to consider two sub-cases: 

• Tj < Fpred(io) (Figure 13(a)). 

This means that we are in the following situation: the reception of task j on worker Soptij) 
has already finished when worker Soptiio) finishes the work it has been scheduled until step 
io - 1. 

In this case we exchange the tasks io and j of schedule Sopt and we create the following 
schedule S'^^j: 

S'opti'^o) = Soptij) = Salgoiio), 

S'optU) = Soptiio) 

and Vi ^ io,j, S'optii) = Soptii)- The schedule of the other workers is kept unchanged. All 
tasks are executed at the same date than previously (but maybe not on the same processor) . 



S optima) 



Salgoiio) ^ Soptij) 





















io 






j 


in + t 










'0 + S„„(i„) 








J '» + '■■ 






















J 


,; + 




j + 


1 








j J + i S.„„(i„) = S„„(j) 






•*o j + 1 


T 




1 

d'j] 




T 


T 


dij) Fprc 





(a) Before the exchange. 



(b) After exchange. 



Figure 13: Schedule Sopt before and after exchange of tasks io and j. 
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Now we prove that this kind of exchange is possible. 

We know that worker Sopt(j) is not scheduled any task later than step io ~ 1 and before 
step J, by definition of j. So we know that this worker can start processing task j when 
task j has arrived and when it has finished processing its amount of work scheduled until 
step io — 1- We already know that worker Sopt{j) — Saigoiio) finishes processing its tasks 
scheduled until step io — 1 at a time earher than or equal to that of worker Sopt{io) (cf. 
Lemma 6). As we are in homogeneous conditions, communications and processing of a task 
takes the same time on all processors. So we can exchange the destinations of steps io and 
j and keep the same moments of execution, as both tasks will arrive in time to be processed 
on the other worker: task iq will arrive at worker Sopt{j) when it is still processing and the 
same for task j on worker Sopt{io)- Hence task io will be sent to worker Sopt{j) = Saigo{io) 
and worker Soptiio) will receive task j. So schedule Sopt and schedule Saigo are the same for 
all tasks [l..io] now. As both tasks arrive in time and can be executed instead of the other 
task, we do not change anything in the makespan M. And as Sopt is optimal, we keep the 
optimal makespan. 



Tj > Fpred(io) (Figure 14(a)). 



In this case we have the following situation: task j arrives on worker Sopt{j), when worker 
Soptiio) has already finished processing its tasks scheduled until step io — 1. 
In this case we exchange the schedule destinations io and j of schedule Sopt beginning at 
tasks io and j (see Figure 14) . In other words we create a schedule S'op^ : 
\/i > io such that Soptii) = Soptiio)- 5'opt(*) = Soptij) = Saigoiio) 



Vi > j such that Soptii) = Soptij)- 



S'opti"^) 



Soptiio) 



and V« < io S'optii) = Soptii)- The schedule Sopt of the other workers is kept unchanged. We 



recompute the finish times -F's^\(j) of workers Soptij) and Soptiio) for all steps s > io- 



4 



SV(;o) = S„„(;) 



.) + 1 

J \ J + 1 



(a) Before exchange. 



(b) After exchange. 



Figure 14: Schedule Sopt before and after exchange of lines io and j. 

Now we prove that this kind of exchange is possible. First of all we know that worker Saigoiio) 
is the same as the worker chosen in step j under schedule Sopt and so Saigoiio) = Soptij) - 
We also know that worker Soptij) is not scheduled any tasks later than step io — 1 and before 
step j, by definition of j. Because of the choice of worker Saigoiio) = Soptij) in Saigo, we 
know that worker Soptij) has finished working when task j arrives: at step io worker Soptij) 
finishes earlier than or at the same time as worker Soptiio) (Lemma 6) and as we are in the 
case where Tj > -Fpred(jo)i Soptij) has also finished when j arrives. So we can exchange the 
destinations of the workers Soptiio) and Soptij) in the schedule steps equal to, or later than, 
step io and process them at the same time as we would do on the other worker. As we have 
shown that we can start processing task j on worker Soptiio) at the same time as we did 
on worker Soptij), and the same for task io, we keep the same makespan. And as Sopt is 
optimal, we keep the optimal makespan. 

Case 2: If there does not exist a j, i.e., we can not find a schedule step j > io such that worker 
Saigoiio) is scheduled a task under schedule Sopt, so we know that no other task will be scheduled 
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on worker Saigo{io) under the schedule Sopt- As our algorithm chooses in step s the worker that 
finishes task s + 1 the first, we know that worker Saigo(io) finishes at a time earlier or equal to that 
of Sopt- Worker Saigo{io) will be idle in the schedule Sopt for the rest of the algorithm, because oth- 
erwise we would have found a step j. As we are in homogeneous conditions, we can simply displace 
task io from worker Soptiio) to worker Saigoiio) (see Figure 15). As we have Sopt(io) Saigoiio) 
and with Lemma 6 we know that worker Saigo{io) finishes processing its tasks until step zq — 1 at 
a time earlier than or equal to Sopt{io), and we do not downgrade the execution time because we 
are in homogeneous conditions. 
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(a) Before displacing 



(b) After displacing 



Figure 15: Schedule Sopt before and after displacing task zq. 

Once we have done the exchange of task io, the schedules Sopt and Saigo are the same for all 
tasks [l..io]- We restart the transformation until Sopt = Saigo for all tasks [1.. min(sa, So)] sched- 
uled by Saigo- 



Now we will prove by contradiction that the number of tasks scheduled by Saigo, Sa, and S, 



opt, 



So, are the same. After niin(sa,So) transformation steps Sopt — Saigo for all tasks [1.. min(sa, So)] 
scheduled by Saigo- So if after these steps Sopt = Saigo for all n tasks, both algorithms redistributed 
the same number of tasks and we have finished. 



We now consider the case Sa ^ So- In the case of Sa > Sq, Saigo schedules more tasks than S, 



^opt - 

At each step of our algorithm we do not increase the makespan. So if we do more steps than Sopt, 
this means that we scheduled some tasks without changing the global makespan. Hence Saigo is 
optimal. 

If Sa < So, this means that Sopt schedules more tasks than Saigo does. In this case, after Sa 
transformation steps, Sopt still schedules tasks. If we take a look at the schedule of the (sa + l)-th 
task in Sopt- regardless which receiver Sopt chooses, it will increase the makespan as we prove 
now. In the following we will call Saigo the worker our algorithm would have chosen to be the 
sender, raigo the worker our algorithm would have chosen to be the receiver. Sopt and ropt are 
the sender and receiver chosen by the optimal schedule. Indeed, in our algorithm we would have 
chosen Saigo as sender such that it is a worker which finishes last. So the time worker Saigo finishes 
processing is Fg^^^^ = M (Saigo)- Saigo chooses the receiver raigo such that it finishes processing 
the received task the earliest of all possible receivers and such that it also finishes processing the 
receiving task at the same time or earher than the sender would do. As Saigo did not decide to 
send the {sa -I- l)-th task, this means, that it could not find a receiver which fitted. Hence we know, 
regardless which receiver Sopt chooses, that the makespan will strictly increase (as Saigo = Sopt for 
all [l..Sa]). We take a look at the makespan of Saigo if we would have scheduled the (sa + l)-th task. 
We know that we can not decrease the makespan anymore, because in our algorithm we decided 
to keep the schedule unchanged. So after the emission of the (s^ + l)-th task, the makespan would 
become M (Saigo) = Pr^iga — ^s^iga- And -Fr„,go — ^ropt, because of the definition of receiver raigo- 
As M(sopt) > Fropt, we have M(Saigo) < M(Sopt)- But we decided not to do this schedule as 
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M{Saigo) is smaller before the schedule of the (sq + l)-th task than afterwards. Hence we get 
that M{Saigo) < M{Sopt)- So the only possibility why Sopt sends the (sa + l)-th task and still 
be optimal is that, later on, Vopt sends a task to some other processor rfc. (Note that even if we 
choose Sopt to have no such chains in the beginning, some might have appeared because of our 
previous transformations). In the same manner as we transformed sending chains in Lemma 7, 
we can suppress this sending chain, by sending task {sa + 1) directly to instead of sending to 
fopt- With the same argumentation, we do this by induction for all tasks k, (sa + I) < k < Sq, 
until schedule Sopt and Saiga have the same number Sq — Sq and so Sopt ~ Saiga and hence 

M{Sopt) = M (Saiga). □ 

Complexity: The initialization phase is in 0{m), as we have to compute the finish times for 
each worker. The while loop can be run at maximum n times, as we can not redistribute more 
than the n tasks of the system. Each iteration is in the order of 0(m), which leads us to a total 
run time of 0(m x n). 

3.4 Scheduling on platforms with homogeneous communication links 
and heterogeneous computation capacities 

In this section we present an algorithm for star-platforms with homogeneous communications and 
heterogeneous workers, the MooRE Based Binary-Search Algorithm (MBBSA). For a given 
makespan, we compute if there exists a possible schedule to finish all work in time. If there is one, 
we optimize the makespan by a binary search. The plan of the section is as follows: In Section 3.4.1 
we present an existing algorithm which will be the basis of our work. The framework and some 
usefull notations are introduced in Section 3.4.2, whereas the real algorithm is the subject of 
Section 3.4.3. 

3.4.1 Moore's algorithm 

In this section we present Moore's algorithm [6, 18], whose aim is to maximize the number 
of tasks to be processed in-time, i.e., before tasks exceed their deadlines. This algorithm gives a 
solution to the 1|| ^ Uj problem when the maximum number, among n tasks, has to be processed 
in time on a single machine. Each task k, 1 < k < n, has a processing time Wk and a deadline dk, 
before which it has to be processed. 

Moore's algorithm works as follows: All tasks are ordered in non-decreasing order of their 
deadhnes. Tasks are added to the solution one by one in this order as long as their deadlines are 
satisfied. If a task k is out of time, the task j in the actual solution with the largest processing 
time Wj is deleted from the solution. 

Algorithm 2 [6, 18] solves in O(nlogn) the 1|| problem: it constructs a maximal set a of 

early jobs. 

Algorithm 2 Moore's algorithm 
1: Order the jobs by non-decreasing deadlines: di < d2 < ■ ■ ■ < dd 
2: (7 ^ 0; < ^ 
3: for i := 1 to n do 

4: (T ^ aU{i} 

5: t ^ t + Wi 

6: if t > di then 

7: Find job j in a with largest Wj value 
8: a<- a\{j} 

9: t t ~ Wj 

10: end if 
11: end for 
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3.4.2 Framework and notations for MBBSA 

We keep the star network of Section 3.1 with homogeneous communication links. In contrast 
to Section 3.3 we suppose m heterogeneous workers who own initially a number Li of identical 
independent tasks. 

Let M denote the objective makespan for the searched schedule a and fi the time needed by 
worker i to process its initial load. During the algorithm execution we divide all workers in two 
subsets, where S is the set of senders {si € S ii ft > M) and R the set of receivers (r^ € R 
if fi < M). As our algorithm is based on Moore's, we need a notation for deadhnes. Let d^r} 
be the deadline to receive the k-ih. task on receiver r,;. 1^^ denotes the number of tasks sender 
i sends to the master and 1^ stores the number of tasks receiver i is able to receive from the 
master. With help of these values we can determine the total amount of tasks that must be sent 
as Lsend = X)s • '^^^ total amouut of task if all receivers receive the maximum amount of tasks 
they are able to receive is Lrecv = Ylr 'n- Finally, let Lsched be the maximal amount of tasks 
that can be scheduled by the algorithm. 



3.4.3 Moore based binary search algorithm - MBBSA 

Principle of the algorithm: Considering the given makespan we determine overcharged work- 
ers, which can not finish all their tasks within this makespan. These overcharged workers will 
then send some tasks to undercharged workers, such that all of them can finish processing within 
the makespan. The algorithm solves the following two questions: Is there a possible schedule such 
that all workers can finish in the given makespan? In which order do we have to send and receive 
to obtain such a schedule? 



The algorithm can be divided into four phases: 

Phase 1 decides which of the workers will be senders and which receivers, depending of the 
given makespan (see Figure 16). Senders are workers which are not able to process all their 
initial tasks in time, whereas receivers are workers which could treat more tasks in the given 
makespan M than they hold initially. So sender Pi has a finish time fi > M, i.e., the time 
needed to compute their initial tasks is larger than the given makespan M. Conversely, Pi 
is a receiver if it has a finish time fi < M . So the set of senders in the example of Figure 16 
contains si and s^, and the set of receivers ri, ^2, and r„. 

i 1 ^1 tasks which can be computed in time 

I I taslcs whicli can not be computed in time 




T = T = M 



Figure 16: Initial distribution of the tasks to the workers, dark colored tasks can be computed 
in-time, fight colored tasks will be late and have to be scheduled on some other workers. 



Phase 2 fixes how many transfers have to be scheduled from each sender such that the senders 
all finish their remaining tasks in time. Sender Si will have to send an amount of tasks 
(i.e., the number of light colored tasks of a sender in Figure 16). 
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Phase 3 computes for each receiver the deadUne of each of the tasks it can receive, i.e., a pair 
{drj,rj) that denotes the z-th deadhne of receiver rj. Beginning at the makespan M one 
measures when the last task has to arrive on the receiver such that it can be processed in 
time. So the latest moment that a task can arrive so that it can still be computed on receiver 
rj is T — , and so on. See Figure 17 for an example. 



T - (Ir, - 1) 



receiver Vj computation of initial tasks L^, 



(k) 

Figure 17: Computation of the deadlines dr/ for worker r 



Phase 4 is the proper scheduling step: The master decides which tasks have to be scheduled on 
which receivers and in which order. In this phase we use Moore's algorithm. Starting at 
time i = c (this is the time, when the first task arrives at the master), the master can start 
scheduHng the tasks on the receivers. For this purpose the deadlines {d,rj) are ordered by 
non-decreasing d-values. In the same manner as in Moore's algorithm, an optimal schedule 
a is computed by adding one by one tasks to the schedule: if we consider the deadline {d, rj), 
we add a task to processor rj . The corresponding processing time is the communication time 
c of rj. So if a deadline is not met, the last reception is suppressed from a and we continue. 
If the schedule is able to send at least Lsend tasks the algorithm succeeds, otherwise it fails. 

Algorithm 3 describes MBBSA in pseudo-code. Note that the algorithm is written for heteroge- 
neous conditions, but here we study it for homogeneous communication Hnks. 

Theorem 3. MBBSA (Algorithm 3) succeeds to build a schedule a for a given makespan M , if 
and only if there exists a schedule with makespan less than or equal to M , when the platform 
is made of one master, several workers with heterogeneous computation power but homogeneous 
communication capabilities. 

Proof. Algorithm 2 (Moore's Algorithm) constructs a maximal set a of early jobs on a single 
machine scheduling problem. So we are going to show that our algorithm can be reduced to this 
problem. 

As we work with a platform with homogeneous communications, we do not have to care about 
the arrival times of jobs at the master, apart from the first job. Our deadlines correspond to the 
latest moments, at which tasks can arrive on the workers such that they can be processed in-time 
(see Figure 17). So we have a certain number Lrecv of possible receptions for all receivers. 

Phases 1 to 3 prepare our scheduling problem to be similar to the situation in Algorithm 2 and 
thus to be able to use it. 

In phase 1 we distinguish which processors have to be senders, which have to be receivers. 
With Lemma 7 we know that we can partition our workers in senders and receivers (and workers 
which are none of both), because senders will never receive any tasks. Phase 2 computes the 
number of tasks Lsend that has to be scheduled. Phase 3 computes the (dr*^"*, rj)-values, i.e., the 

(k) 

deadlines di-/ for each receiver rj. Step 4 is the proper scheduling step and it corresponds to 
Moore's algorithm. It computes a maximal set <j of in-time jobs, where Lgched is the number of 
scheduled tasks. 

The algorithm returns true if the number of scheduled tasks Lsched is bigger than, or equal 
to, the number of tasks to be sent Lsend- 

Now we will prove that if there exists a schedule whose makespan is less than, or equal to, Af, 
Algorithm 3 builds one and returns true. Consider an optimal schedule cr* with a makespan M. 
We will prove that Algorithm 3 will return true. 
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Algorithm 3 Algorithm for star-platforms with homogeneous communications and heterogeneous 

workers 

/* Phase 1: InitiaHzation */ 
initialize fi for all workers i, fi = Li x Wi 

compute R and S, order S by non-decreasing values q such that c^j < < . . . 
/* Phase 2: Preparing the senders */ 
for all Si G S do 



Is, 

if 



fs.-T 



< L. then 



/* M too small */ 
return {false, 0) 
end if 
end for 

total number of tasks to send: Lsend <— J2s 
/* Phase 3: Preparing the receivers */ 

for all ri E R do 

Ir, ^ 

while /r, < AI - {In + 1) x lo^, do 

dr/ *'' ^ A/ — (Zri X WrJ 

D ^ DU{dl'p\r,) 
end while 
end for 

number of tasks that can be received: Lrecv <— J2r 
/* Phase 4: The master schedules */ 

senders send in non-decreasing order of values to the master 

order deadline-list D by non-decreasing values of deadlines and rename the deadlines in 
this order from 1 to Lrecv 
a ^ t ^ Cs^ \ Lsched = 0; 
for i = 1 to Lrecv do 

{di,ri) <— i-th element {di''J,rk) of D 

a ^ a U {ri} 

t^t + Cn 

Lsched ^ L^cJied 4~ 1 

if i > di then 

Find {dj , Vj ) in a such that Cr^ value is largest 

cr ^ a\{{dj,rj)} 

t ^ t — Crj 

Lsched ^ Lacked 1 

end if 
end for 

return {{Lsched > Lsend), o-) 
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We have computed, for each receiver rj , l^j the maximal number of tasks rj can process after 
having finished to process its initial load. Let Nr^ denote the number of tasks received by rj 
in a*, Nr- < Ir^- For all receivers rj we know the number Nr- of scheduled tasks. So we have 
^sched ^ ^r., ■ As in an optimal schedule all tasks sent by the senders are processed on 
the receivers, we know that inched = ^tend- I^^t us denote D the set of deadlines computed in 
our algorithm for the scheduling problem of which a* is an optimal solution. We also define the 
following set D* = UiUi<j<Ar,. (J^^ ~i ^ Wri,ri) of the Nr- latest deadlines for each receiver rj. 
Obviously D* C D. The set of tasks in a* is exactly a set of tasks that respects the deadlines in 
D*. The application of the algorithm of Moore on the same problem returns a maximal solution 
if there exists a solution. With D* C D, we already know that there exists a solution with L*^ched 
scheduled tasks. So Moore's algorithm will return a solution with Lsched > Lsched*, as there are 
more possible deadlines. On the other side, we have L*^^^ > Lsend as Lsend is the minimal number 
of tasks that have to be sent to fit in the given makespan. So we get that Lsched > Lsend- As 
we return true in our algorithm if Lsched > Lsend, we will return true whenever there exists a 
schedule whose makespan is less than, or equal to, M . 



receiver r; 



mputation of initial taslcs L,, 



4' 



Figure 18: Number of loads scheduled to receiver rj in order to its deadlines. 



Now we prove that if Algorithm 3 returns true there exists a schedule whose makespan is 
less than, or equal to, M. Our algorithm returns true, if it has found a schedule a where 
Lsched ^ Lsend- If Lsche d = Lsend then the schedule a found by our algorithm is a schedule whose 
makespan is less than, or equal to, M . If Lsched > Lsend, we take the Lsend first elements of a, 
which still defines a schedule whose makespan is less than, or equal to, M. □ 

Proposition 2. Algorithm 4 returns in polynomial time an optimal schedule a for the following 
scheduling problem: minimizing the makespan on a star-platform with homogeneous communica- 
tion links and heterogeneous workers where the initial tasks are located on the workers. 



Proof. We perform a binary search for a solution in a starting interval of [min(/i), max(/i)]. As 
we are in heterogeneous computation conditions, we have heterogeneous w^-values, 1 < i < m, 
Wi G Q. The communications instead are homogeneous, so we have a = c, 1 < i < m, c £ Q. Let 
the representation of the values be of the following form: 

Pi 

where ai and Pi are prime between each other, 

c» = c=^,7,(5eNxN*, 



where 7 and S are prime between each other. 

Let A be the least common multiple of the denominators Pi and 5i, X = lcm{/3i, S}, 1 < i < m. 
As a consequence for any i in [l..m] A x G N, A x e N. Now we have to choose the precision 
which allows us to stop our binary search. For this, we take a look at the possible finish times of 
the workers: all of them are linear combinations of the different c; and Wi-values. So if we multiply 
all values with A we get integers for all values and the smallest gap between two finish times is at 
least 1. So the precision p, i.e., the minimal gap between two feasible finish times, is p = j. 
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Algorithm 4 Algorithm to optimize the malcespan. 

/* idea: make a binary search of M E [min(/i), max(/i)] */ 

input: Wi = ^,a„(3i e N x N*, c, ^ ^,lz,Sz G N x N* 

A ^ lcm{/3j, Si}, 1 < i < m 

precision <— j 

lo ^ min(/i); hi ^ max(/i); 

procedure binary-Search(lo, hi): 

gap ^ \lo — hi\ 

while gap > precision do 

M ^ {lo + hi)/2 

found ^ MBBSA (M) 

if ^ound then 

/* M is too small */ 
lo ^ M 

else 

/* M is maybe too big */ 

hi ^ M 

a ^ found schedule 
end if 

gap <~ \lo ~ hi\ 
end while 
return a 



Complexity: The maximal number of different values M we have to try can be computed as 
follows: we examine our algorithm in the interval [min(/i).. max(/i)]. The possible values have an 
increment of j. So there are (max(/,;) — min(/i)) x A possible values for M. 

So the complexity of the binary search is 0{log{{niax{fi) — min(/i)) x A)). Now we have to 
prove that we stay in the order of the size of our problem. Our platform parameters c and Wi are 
given in the form Wi = ^ and c = ^. So it takes log(ai) + log(/3i) to store a Wi and log(7) + log((5) 
to store a c. So our entry E has the following size: 



So we already know that our complexity is bounded by 0{\E\ + log(max(/i) — min(/i))). We can 
simplify this expression: 0{\E\ + log(max(/i) — min(/i))) < 0{\E\ + log(max(/i))). It remains to 
upperbound log(max(/,;)). 

Remember max(/i) is defined as max(/i) = maxi(ij x Wi) = Li^ x Wig. Thus log(max(/i)) = 
log(Lio) + log (uiio). Lig is a part of the input and hence its size can be upper-bounded by the size of 
the input E. In the same manner we can upperbound \og{'Wig) by \og{wig) = \og{aig) + \og{Pig) < 
E. 

Assembling all these upperbounds, we get 0(log((max(/i) — min(/i)) x A)) < 0(3|i?|) and 
hence our proposed algorithm needs 0{\E\) steps for the binary search. The total complexity 
finally is 0{\E\ x max(nm, n^)), where n is the number of scheduled tasks and m the number of 
workers. 



We can do the following estimation: 



E> ^log(A) + log(<5) =log[l[p,xS \ > log(A) 




□ 
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3.5 Heuristics for heterogeneous platforms 

As there exists no optimal algorithm to build a schedule in polynomial runtime (unless P = NP) for 
heterogeneous platforms, we propose three heuristics. A comparative study is done in Section 4. 

• The first heuristic consists in the use of the optimal algorithm for homogeneous platforms 
BBA (see Algorithm 1). On heterogeneous platforms, at each step BBA optimizes the local 
makespan. 

• Another heuristic is the utiHzation of the optimal algorithm for platforms with homogeneous 
communication links MBBSA (see Algorithm 3). The reason why MBBSA is not optimal on 
heterogeneous platforms is the following: Moore's algorithm, that is used for the scheduling 
step, cares about the tasks already on the master, but it does not assert if the tasks have 
already arrived. The use of homogeneous communication links eliminated this difficulty. We 
can observe that in the cases where the overcharged workers (i.e., the senders) communicate 
faster than the undercharged workers (i.e., the receivers), MBBSA is also optimal. However, 
the problem with this statement is that we do not know a priori which processors will work 
as senders. So in the case of heterogeneous platforms, where sending workers have faster 
communication links than receiving ones, the results will be optimal. 

• We propose a third heuristic: the Reversed Binary-Search Algorithm (see Algorithm 5 
for details). This algorithm copies the idea of the introduction of deadlines. Contrary 
to MBBSA this algorithm traverses the deadlines in reversed order, wherefrom the name. 
Starting at a given makespan, R-BSA schedules all tasks as late as possible until no more 
task can be scheduled. 

R-BSA can be divided into four phases: 

Phase 1 is the same as in MBBSA. It decides which of the workers will be senders and 
which receivers, depending of the given makespan (see Figure 16). 

Phase 2 fixes how many transfers have to be scheduled from each sender such that the 
senders all finish their remaining tasks in time. This phase is also identical to MBBSA. 

Phase 3 computes for each receiver at which time it can start with the computation of the 
additional tasks, this is in general the given makespan. 

Phase 4 again is the proper scheduling step: Beginning at the makespan we fill backward 
the idle times of the receiving workers. So the master decides which tasks have to be 
scheduled on which receivers and in which order. The master chooses a worker that 
can start to receive the task as late as possible and still finish it in time. 



4 Simulations 

In this section we present the results of our simulation experiences of the presented algorithms 
and heuristics on multiple platforms. We study the heuristics that we presented in Section 3.5. 

4.1 The simulations 

All simulations were made with SimGrid [16, 24]. SimGrid is a toolkit that provides several func- 
tionalities for the simulation of distributed applications in heterogeneous distributed environments. 
The toolkit is distributed into several layers and offers several programming environments, such as 
XBT, the core toolbox of SimGrid or SMPI, a library to run MPI applications on top of a virtual 
environment. The access to the different components is ensured via Application Programming 
Interfaces (API). We use the module MSG to create our entities. 
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Algorithm 5 Reversed Binary-Search Algorithm 



/* Phase 1: Initialization */ 

T ^ M; L,,hed ^ 0; a ^ 
Vfc Lf^ ^ Lk 

initialize endi for all workers i: endi = Li x Wi 
compute R and S, order S by non-decreasing values q: Cs^ < < 
master _in <— Cs^ 
/* Phase 2: Preparing the senders */ 
for all Si € S do 

^ ^ 'end^.-T' 

if — < ^5 then 

/* M too small */ 
return (false, 0) 
end if 
end for 

total number of tasks to be sent: Lsend ^ J2s- 
/* Phase 3: Determination of the last deadline */ 
for all Ti G i? do 

if endn < T then 
begirir- *— T 

end if 
end for 

/* Phase 4: The scheduling */ 
while true do 

choose receiver such that it is the worker that can start receiving it as late as possible, i.e., 
max; (min(6egmi — Wi,T)) — a is maximal and that the schedule is feasible: the task must 
fit in the idle gap of the worker: [begirireceiver — Wreceiver > endreceiver) and the task has 
to be arrived at the master: (begirireceiver — Wreceiver — Creceiver > master _in) . 
if no receiver' found then 

return {{Lsched < Lsend), (j) 

end if 

begiTij-QQeiver ^ begtrij-eceiver '^receiver 
T < begtrij-eceiver ^receiver 
Lsched ^ Lacked 1 

(7 ^ CT U {receiver} 
i ^i + 1 
end while 
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The simulations were made on automatically created random platforms of four types: We 
analyze the behavior on fully homogeneous and fully heterogeneous platforms and the mixture of 
both, i.e., platforms with homogeneous communication links and heterogeneous workers and the 
converse. For every platform type 1000 instances were created with the following characteristics: 
In absolute random platforms, the random values for a and Wi vary between 1 and 100, whereas 
the number of tasks is at least 50. In another test series we make some constraints on the 
communication and computation powers. In the first one, we decide the communication power 
to be inferior to the computation power. In this case the values for the communication power 
vary between 20 and 50 and the computation powers can take values between 50 and 80. In the 
opposite case, where communication power is supposed to be superior to the computation power, 
these rates are conversed. 



4.2 Trace tests 

To verify the right behavior of the algorithms, we made some trace tests. So the visualization of 
the runs on a small test platform are shown in this section. 

We use a small platform with homogeneous communication links, c = 2, so the bandwidth is 
0.5. We use four heterogeneous workers with the following w-values: Pi and P2 compute faster, 
so we set Wi = W2 = 3. Worker P3 and P4 are slower ones with W3 = W4 = 4. Pi owns 8 tasks 
at the beginning, P2 and P3 respectively one task, whereas worker P4 has no initial work. The 
optimal makespan is Af = 13, as we computed by permutation over all possible schedules. 

In the following figures, computation are indicated in black. White rectangles denote inter- 
nal blockings of SimGrid in the communication process of a worker. These blockings appear 
when communication processes remark that the actual message is not destined for them. Grey 
rectangles represent idle time in the computation process. The light grey fields finally show the 
communication processes between the processors. 

The schedule of BBA can be seen in Figure 19. Evidently the worker with the latest finish time 
is Pi, worker P2 can finish the first sent task earher than workers P3 and P4, so it is the receiver 
for the first task. In this solution, worker Pi sends four tasks, which are received by P2, P4, P2 
and once again P4. The makespan is 14, so the schedule is not optimal. This does not contradict 
our theoretical results, as we proved optimality of BBA only on homogeneous platforms. 




Figure 19: Trace of the simulation of BBA. 



MBBSA achieves as expected the optimal makespan of 13 (see Figure 20). As you can see by 
comparing Figures 19 and 20, the second task scheduled by MBBSA (to worker P2) is finished pro- 
cessing later than in the schedule of BBA. So MBBSA, while globally optimal, does not minimize 
the completion time of each task. 

R-BSA finds also an optimal schedule (cf. Figure 21). Even in this small test the difference of 
R-BSA and MBBSA is remarkable: R-BSA tries to schedule the most tasks as possible by fiUing 
idle times starting at the makespan. MBBSA contrarily tries to schedule tasks as soon as possible 
before their deadhnes are expired. 
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Figure 20: Trace of the simulation of MBBSA. 
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Figure 21: Trace of the simulation of R-BSA. 



4.3 Distance from the best 

We made a series of distance tests to get some information of the mean qualitiy of our algorithms. 
For this purpose we ran all algorithms on 1000 different random platforms of the each type, i.e., 
homogeneous and heterogeneous, as well as homogeneous communication Hnks with heterogeneous 
workers and the converse. We normaHzed the measured schedule makespans over the best result 
for a given instance. In the following figures we plot the accumulated number of platforms that 
have a normalized distance less than the indicated distance. This means, we count on how many 
platforms a certain algorithm achieves results that do not differ more than X% from the best 
schedule. For exemple in Figure 22(b): The third point of the R-BSA-Hne significates that about 
93% of the schedules of R-BSA differ less than 3% from the best schedule. 

Our results on homogeneous platforms can be seen in Figures 22. As expected from the 
theoretical results, BBA and MBBSA achieve the same results and behave equally well on all 
platforms. R-BSA in contrast shows a sensibility on the platform characteristics. When the 
communication power is less than the computation power, i.e. the Q-values are bigger, R-BSA 
behaves as good as MBBSA and BBA. But in the case of small Q-values or on homogeneous 
platforms without constraints on the power rates, R-BSA achieves worse results. 

The simulation results on platforms with homogeneous communication Hnks and heterogeneous 
computation powers (cf. Figure 23) consoHdate the theoretical predictions: Independently of the 
platform parameters, MBBSA always obtains optimal results, BBA differs slightly when high 
precision is demanded. The behavior of R-BSA strongly depends on the platform parameters: 
when communications are slower than computations, it achieves good results. 

On platforms with heterogeneous communication Hnks and homogeneous workers, BBA has 
by far the poorest results, whereas R-BSA shows a good behavior (see Figure 24). In general it 
outperforms MBBSA, but when the communication links are fast, MBBSA is the best. 

The results on heterogeneous platforms are equivalent to these on platforms with heterogeneous 
communication Hnks and homogeneous workers, as can be seen in Figure 25. R-BSA seems to be 
a good candidate, whereas BBA is to avoid as the gap is up to more than 40%. 
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(a) Homogeneous platform (general case). 
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(b) Homogeneous platform, faster communicating. 
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(c) Homogeneous platform, faster computing 



Figure 22: Frequency of the distance to the best on homogeneous platforms. 
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(a) General platform. 
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(b) Faster communicating. 
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(c) Faster computing 



Figure 23: Frequency of the distance to the best on platforms with homogeneous communication 
hnks and heterogeneous computation power. 
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(a) General platform. 
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(b) Faster communicating. 
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(c) Faster computing. 



Figure 24: Frequency of the distance to the best on platforms with heterogeneous communication 
hnks and homogeneous computation power. 
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(a) Heterogeneous platform (general case). 
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(b) Heterogeneous platform, faster communicating. 
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(c) Heterogeneous platform, faster computing. 

Figure 25: Frequency of the distance to the best on heterogeneous platforms. 
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4.4 Mean distance and standard deviation 

We also computed for every algorithm the mean distance from the best on each platform type. 
These calculations are based on the simulation results on the 1000 random platforms of Section 4.3. 
As you can see in Table 1 in general MBBSA achieves the best results. On homogeneous platforms 
BBA behaves just as well as MBBSA and on platforms with homogeneous communication links 
it also performs as well. When communication links are heterogeneous and there is no knowledge 
about platform parameters, R-BSA outperforms the other algorithms and BBA is by far the worse 
choice. 



Platform type 
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c 


< 


w 
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1.0189 
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0.0127 
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c 


> 


w 


1.0261 
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1.0046 


0.0384 


0.0118 


0.0121 



Table 1: Mean distance from the best and standard deviation of the different algorithms on each 
platform type. 

The standard deviations of all algorithms over the 1000 platforms are shown in the right part 
of Table 1. These values mirror exactly the same conclusions as the listing of the mean distances 
in the left part, so we do not comment on them particularly. We only want to point out that 
the standard deviation of MBBSA always keeps small values, whereas in case of heterogeneous 
communication links BBA-heuristic is not recommendable. 

5 Load balancing of divisible loads using the multiport switch- 
model 

5.1 Framework 

In this section we work with a heterogeneous star network. But in difference to Section 3 we 
replace the master by a switch. So we have m workers which are interconnected by a switch and 
TO heterogenous links. Link i is the link that connects worker Pi to the switch. Its bandwidth is 
denoted by h. In the same way s,; denotes the computation speed of worker Pi. Every worker 
Pi possesses an amount of initial load a^. Contrarily to the previous section, this load is not 
considered to consist of identical and independent tasks but of divisible loads. This means that 
an amount of load X can be divided into an arbitrary number of tasks of arbitrary size. As 
already mentioned, this approach is called Divisible Load Theory - DLT [4]. The communication 
model used in this case is an overlapped unbounded switched-multiport model. This means all 
communications pass by a centralized switch that has no throughput limitations. So all workers 
can communicate at the same time and a given worker can start executing as soon as it receives 
the first bit of data. As we use a model with overlap, communication and computation can take 
place at the same time. 

As in the previous section our objective is to balance the load over the participating workers 
to minimize the global makespan M . 
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5.2 Redistribution strategy 

Let a he a solution of our problem that takes a time T. In this solution, there is a set of sending 
workers S and a set of receiving workers R. Let sendi denote the amount of load sent by sender 
Pi and recvj be the amount of load received by receiver Pj, with sendi > 0, recvj > 0. As all load 
that is sent has to be received by another worker, we have the following equation: 

sendi — recVj — L. (1) 

In the following we describe the properties of the senders: As the solution a takes a time T, the 
amount of load a sender can send depends on its bandwidth: So it is bounded by the time-slot of 

V sender, G S, < T. (2) 

Oi 

Besides, it has to send at least the amount of load that it can not finish processing in time T. 
This lowerbound can be expressed by 

V sender^ € S, sendi > ai — T x Si. (3) 

The properties for receiving workers are similar. The amount of load a worker can receive is 
dependent of its bandwidth. So we have: 

V receiver^- € R, —-^ < T. (4) 

Additionally it is dependent of the amount of load it already possesses and of its computation 
speed. It must have the time to process all its load, the initial one plus the received one. That is 
why we have a second upperbound: 

V receiver, e 5, < T. (5) 

For the rest of our paper we introduce a new notation: Let Si denote the imbalance of a worker. 
We will define it as follows: 

{sendi if i E S 
—recvi if i e i? 



With the help of this new notation we can re-characterize the imbalance of all workers: 

• This imbalance is bounded by 

\S,\ <hxT. 

— If i S 5*, worker Pi is a sender, and this statement is true because of inequality 2. 

— If i e i?, worker Pi is a receiver and the statement is true as well, because of inequality 4. 

• Furthermore, we lower-bound the imbalance of a worker by 

5,>a,~Txs,. (6) 

— If i e S", we are in the case where 6i — sendi and hence this it true because of equation 3. 

— If i G i?, we have 5i = —recVi < 0. Hence we get that (6) is equal to —recvi > ai—Tx Si 
which in turn is equivalent to (5). 

• Finally we know as well that J2i = because of equation 1. 
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If we combine all these constraints we get the following linear program (LP) , with the addition 
of our objective to minimize the makespan T. This combination of all properties into a LP is 
possible because we can use the same constraints for senders and receivers. As you may have 
noticed, a worker will have the functionality of a sender if its imbalance 6i is positive, receivers 
being characterized by negative i5i-values. 

Minimize T, 

under the constraints 
(7a) |,5i| <Txbi 

(7b) S,>a,-Txs, 



(7c) ^5,=0 



All the constraints of the LP are satisfied for the (i5i, T)- values of any schedule solution of the 
initial problem. We call Tq the solution of the LP for a given problem. As the LP minimizes the 
time T, we have Tq < T for all valid schedule and hence we have found a lower-bound for the 
optimal makespan. 

Now we prove that we can find a feasible schedule with makespan Tg. We start from an 
optimal solution of the LP, i.e., Tq and the i5j;-values computed by some LP solvers, such as Maple 
or MuPAD. With the help of these found values we are able to describe the schedule: 

1. Every sender i sends a fraction of load to each receiver j. We decide that each sender sends 
to each receiver a fraction of the senders load proportional to what we denote by 

f.,=S.x^J^^S,x^ (8) 
the fraction of load that a sender Pi sends to a receiver Pj. In other words we have fij — 

S- X -recvj 

2. During the whole schedule we use constant communication rates, i.e., worker j will receive 
its fraction of load fij from sender i with a fixed receiving rate, which is denoted by Xij : 

A., = (9) 

3. A schedule starts at time t = and ends at time t = Tq. 

We have to verify that each sender can send its amount of load in time Tq and that the receivers 
can receive it as well and compute it afterwards. 

Let us take a look at a sender Pi: the total amount it will send is X^je-R fid ~ ^jeR ^ ' ^ ~ 
Si = sendi and as we started by a solution of our LP, Si respects equations 7a and 7b, thus sendi 
respects the constraints 2 and 3 as well, i.e., sendi <Txbi and sendi > Ui ~ T x Si. 

Now we consider a receiver Pj: the total amount it will receive is X^ies /'J ~ ^ies ^ ' ^ ~ 
—Sj = recvj. Worker Pi can receive the whole amount of recvi load in time Tq as it starts the 
reception at time t = and recVi respects constraints 7a and 7b, who in turn respect the initial 
constraints 4 and 5, i.e., recVi < T x hi and recvi < T x Si — ai. Now we examine if worker 
Pi can finish computing all its work in time. As we use the divisible load model, worker Pi can 
start computing its additional amount of load as soon as it has received its first bit and provided 
the computing rate is inferior to the receiving rate. Figure 26 illustrates the computing process 
of a receiver. There are two possible schedules: the worker can allocate a certain percentage of 
its computing power for each stream of loads and process them in parallel. This is shown in 
Figure 26(a). Processor P,; starts immediately processing all incoming load. For doing so, every 
stream is allocated a certain computing rate jij, where i is the sending worker and j the receiver. 
We have to verify that the computing rate is inferior or equal to the receiving rate. 
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The initial load aj of receiver Pj owns at minimum a computing rate such that it finishes right 
in time Tq: = The computing rate jij, for all pairs («, j), i £ S , j G R, has to verify the 
following constraints: 

• The sum of all computing rates does not exceed the computing power Sj of the worker Pj : 

• The computing rate for the amount of load fi,j has to be sufficiently big to finish in time Tq: 

(11) 



Jo 



• The computing rate has to be inferior or equal to the receiving rate of the amount fij : 

7.j<A,j, (12) 

Now we prove that 7^^ = ^ is a valid solution that respects constraints (10), (11), and (12): 



Equation (10) We have (E^es 7.j) - (E 



leS To Tq 



T I I T — -r • Transform- 
Jo / J-o J-0 

ing Equation (7b) in aj — Sj <ToX Sj and using this upperbound we get -^rp- < = sj. 

Hence this constraint holds true. 



Equation (11) By definition of 7^ j this holds true. 

Equation (12) By the definitions of 7^ j and Xij this holds true. 

In the other possible schedule, all incoming load streams are processed in parallel after having 
processed the initial amount of load as shown in Figure 26(b). In fact, this modeling is equivalent 
to the precedent one, because we use the DLT paradigm. We used this model in equations 3 and 5. 



fk, 



Ik, 



To 



7fc„ 



Tn 



(a) Parallel processing. 



(b) Sequential and parallel processing. 



Figure 26: Different schedules to process the received load. 
The following theorem summarizes our cognitions: 



Theorem 4. The combination of the linear program 7 with equations 8 and 9 returns an optimal 
solution for makespan minimization of a load balancing problem on a heterogeneous star platform 
using the switch model and initial loads on the workers. 
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6 Conclusion 

In this report we were interested in the problem of scheduHng and redistributing data on master- 
slave platforms. We considered two types of data models. 

Supposing independent and identical tasks, we were able to prove the NP completeness in the 
strong sense for the general case of completely heterogeneous platforms. Therefore we restricted 
this case to the presentation of three heuristics. We have also proved that our problem is polyno- 
mial when computations are negligible. Treating some special topologies, we were able to present 
optimal algorithms for totally homogeneous star-networks and for platforms with homogeneous 
communication links and heterogeneous workers. Both algorithms required a rather complicated 
proof. 

The simulative experiments consolidate our theoretical results of optimality. On homogeneous 
platforms, BBA is to privilege over MBBSA, as the complexity is remarkably lower. The tests on 
heterogeneous platforms show that BBA performs rather poorly in comparison to MBBSA and 
R-BSA. MBBSA in general achieves the best results, it might be outperformed by R-BSA when 
platform parameters have a certain constellation, i.e., when workers compute faster than they are 
communicating. 

DeaHng with divisible loads as data model, we were able to solve the fully heterogeneous 
problem. We presented the combination of a linear program with simple computation formulas to 
compute the imbalance in a first step and the corresponding schedule in a second step. 

A natural extension of this work would be the following: for the model with independent tasks, 
it would be nice to derive approximation algorithms, i.e., heuristics whose worst-case is guaranteed 
within a certain factor to the optimal, for the fully heterogeneous case. However, it is often the 
case in scheduling problems for heterogeneous platforms that approximation ratios contain the 
quotient of the largest platform parameter by the smallest one, thereby leading to very pessimistic 
results in practical situations. 

More generally, much work remains to be done along the same lines of load-balancing and 
redistributing while computation goes on. We can envision dynamic master-slave platforms whose 
characteristics vary over time, or even where new resources are enrolled temporarily in the execu- 
tion. We can also deal with more complex interconnection networks, allowing slaves to circumvent 
the master and exchange data directly. 
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